COMPUTER SYSTEMS RESEARCH GROUP 
UNIVERSITY OF TORONTO 
il] 


- = 
cee _ a 
Z — 

— . 


Empirical Comparison of LR(k) 
and 


Precedence Parsers 


J.J. Horning and W.R. Lalonde 


REGAN ICAL REPORT CSRG = | 


Abstract: Knuth's LR(k) algorithm provides a more general basis for the 
syntactic portion of compilers than does precedence analysis. We have con- 
ducted experiments to determine, for practical grammars, how much this extra 
generality costs in efficiency. The results indicate that the extra generality 
of the LR(k) approach may often by accompanied by a reduction in table size 


and an increase in parsing speed. 
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INTRODUCT | ON 


We are selecting an efficient context-free parser for inclusion in a 
translator writing system (TWS). The theoretically interesting choices fal | 
into two general classes of bottom-up eee those based on precedence 
analysis Le.g. 3,4,8,9,10] and those related to LR(k) analysis [e.g. |,6,7]. 
Precedence techniques have been extensively used in practical systems. 
Although LR(k) "is the most general type of grammar for which there exists 
an efficient left-to-right recognizer that can be mechanically produced from 
the grammar" [2], LR(k) techniques have apparently received little use, 
probably because of "efficiency considerations." 

Direct application of the Knuth construction [6] to grammars for practical 
programming languages yields forbiddingly large tables L7]. However, DeRemer 
L!J gives a construction which he claims yields parsers which are competitive 
with precedence parsers in both space and time. (His basic strategy is to use 
Knuth's LR (0) construction, and then add lookahead only where it is actually 
required to resolve two actions.) If this claim is true, then the greater 
generality of LR(k) makes it a very attractive alternative to precedence. 

Unfortunately, the substantial differences between LR(k) and precedence 
methods make proofs of their relative efficiencies difficult, perhaps, in 
general, impossible. (A serious problem is that the theoretical bounds on 
LR(k) parser size are orders of magnitude larger than the sizes encountered in 
practice.) In the absence of a general theoretical comparison, we decided to 
perform an empirical comparison of the parsers obtained for a sample of existing 


programming language grammars. This paper gives the results of that comparison. 
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SPACE 


Before actually comparing the parsers, we had to implement the DeRemer 
constructor algorithm LI], choose a base for comparison, and select a set of 
sample grammars. 

The constructor: DeRemer actually describes several variants of his 
constructor. We selected the one he calls LALR (look ahead LR) for. implementa- 
tion (by WRL), and decided to include the various optimizations which he 
describes. The tabular representation of the resulting parser involves a 
further set of decisions. We chose to produce tables for the IBM System/360, 
in the form of initialized declarations in the XPL language L8,9.J; this imposed 
The constraint that storage be allocated in multiples of 8 bits, but should not 
drastically affect our results. More important, rather than representing the 
possible state transitions by a matrix, we chose to store them as tables of 
ordered pairs; this seems a more compact encoding, but has potential speed 
implications (discussed later). Finally, we devoted some effort in the con- 
structor to combining obviously redundant entries in the tables. 

The basis for comparison: We initially chose the mixed-strategy precedence 
(MSP) constructor [8,9] as the basis for comparison. |t seemed a natural choice 
for a number of reasons: we have an implementation (by JJH) in current use; 
we are thoroughly convinced of its usefulness and efficiency; it was the system 
we were considering replacing. 

After obtaining preliminary comparisons with MSP, we decided to expand our 
experiment to include another precedence method. Wirth-Weber simple precedence 
(WWSP) CLIO] is widely known and historically important. Additionally, we 
could accurately estimate table size without actually implementing or running 
the constructor, merely r computing the storage required for the precedence 
matrix plus the productions. It was therefore easy to include WWSP in our 


comparison. 
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The sample grammars: For each precedence method we took a grammar of 
a programming language which was developed specifically for use with that 
method: XPL [8,9] for MSP and EULER? [10] for WWSP. Any bias in these selec- 
tions presumably favors precedence methods. Additionally, we included an 
ALGOL 60 grammar, which is presumably Fat biased towards any particular pars- 
ing method. (We made the modifications described in L7] to remove syntactic 
ambiguities, but did not otherwise alter the grammar.) 

Although this ts not a large sample, we feel that it is typical of The 
grammars our IWS will be required to handle (once their syntactic ambiguities 
have been removed). 


The results: 


Vocabulary size Number of MSP WWSP 
Terminal Nonterminal Productions bytes bytes states 


rammar 


<number> 


ALGOL 60 >6800**} >6 100% 


* Not a WWSP grammar 


** Not a MSP grammar 


These results speak for themselves. The LALR tables are significantly 


smaller than either set of precedence tables. 
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We actually evaluated two versions of the EULER grammar: one (exactly 
as published in LIO]) contains 20 productions defining <number>; the 
other (EULER - <number>) has these 20 productions removed (since this 
level of detail is normally relegated to the compiler's lexical scanner). 
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These sizes may be related to actual compilers by noting that the compi ler 
for XPL (XCOM) L9] requires 105,000 bytes of program and data, and that the 
MSP and LALR parsers, including error recovery, require 1868 and 1416 bytes 
respectively. Thus, the MSP parser and tables constitute 5% of the space of 


XCOM, while the LALR parser and tables would be under 3%. 


| metna®) 


The large size of the precedence tables in the previous section is 
principally due to the use of precedence matrices which grow quadratically with 
the vocabulary size. Had we represented the state transitions by a matrix, 
the LALR tables would also have been large. Our representation saved substan- 
tial space, but at some penalty in speed, by requiring table search, rather than 
simple indexing. We wondered if the resulting parser would be impractical ly 
Slow. 

We could have substituted the new parser into an existing compiler 
(e.g., XCOM) and measured the difference in compile time; however, parsing is a 
small fraction of compilation, and we were unsure how reliably the difference 
could be observed. Instead, we abstracted just the lexical scan and parse rou- 
tines from XCOM. To separate card reading and scanning time from parsing time, 
we had the program prescan an entire program at a time, storing numerical codes 
for the tokens in a large array, and then parse from the array. The results 
(on the IBM 360/44) for three different XPL programs are given in the following 
tab le: 


Size Number of MSP LALR 


RRogram Cards Tokens Reductions seconds seconds 


compact i fy 0.84 WR SEs 


XCOM 45.35 Zoe 
DOSYS elo myete: 30.49 


Again, the results indicate substantially greater efficiency for the LALR 
parser. The fraction of the compiler's time required for parsing drops from 
19% to 11%. 

It is not immediately obvious why MSP, with its fast indexing into the 
precedence matrix, is slower than LALR with its relatively slow table lookups. 
The reason is rooted in one of the basic differences between precedence and 
LR( k) methods. The precedence matrix can only be used to locate the leftmost 


reducible substring in the stack; a table of right parts must still be searched 
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to find the applicable production. With LR(k) methods, however, the same 
function that locates the reducible substring simultaneously indicates the 
applicable production, without an additional table search. This saving appar- 


ently outweighs the cost of the searching the state table. 


CONCLUS | ONS 


Comparison with precedence: The precedence methods we chose as a basis 


for comparison are certainly not the only ones available, and perhaps not the 
most efficient. Table size can sometimes be drastically reduced (at the cost 
of delayed error detection and degraded error recovery) by replacing the pre- 
cedence matrix with precedence functions LIO]J. Ichbiah indicates that his 
precedence-based optimized Floyd productions average six bytes of table per 
production of the grammar [5]. Our results cannot be used to prove that LALR 
is more efficient than any conceivable precedence technique; what they do show 
is that it compares very favorably in efficiency with precedence methods which 
have themselves proved to be quite acceptable in practice. We conclude that 
efficiency is not an objection to the adoption of LR(k)-based techniques. 

We are sufficiently encouraged by these results that we plan to prepare 
a version of the XPL TWS based on LALR rather than MSP parsers. If wider use 
on this campus confirms the usefulness of this approach, we will release the 
revised version through SHARE. Meanwhile, we urge that others designing com- 
pilers or TWSs seriously consider the advantages of LALR when selecting their 
parsing method. 

"In practice ... one must manipulate a grammar for an average programming 
language considerably before it is a precedence grammar ... The final grammar 
could not be presented to a programmer as a reference to the language" [2]. 


However, the only change usually required to convert a "natural" 


grammar to an 
LALR grammar is to remove its syntactic ambiguities (which we assume is 
desirable in any case). Thus the extra generality of LALR may often prove to 
be a major practical advantage. (We do have some preliminary evidence that the 
transformations required to produce precedencegrammars L1!0] are also useful in 
reducing the size of LALR tables, and thus may still prove useful in those situa- 
tions where space is the controlling criterion.) 

Comparison with Korenjak: At this point, it is appropriate to compare 
our method and results with Korenjak's earlier attempt to improve the efficiency 
of LR(k) parsers L7]. Briefly, his approach is to manually segment a large 
grammar into a number of (nearly) independent subgrammars, automatically construct 


LR(|) parsers for each, and automatically merge these parsers (i.e., treat 


them as "Subroutines" of each other and a master parser) if certain LL( |) conditions 
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are met at their boundaries. Compared to LALR, this approach has two disadvan- 
tages. 

(1) The proper segmentation must be guessed at: A poor segmentation can 

cause a LR( |) grammar to be rejected. 

(2) The method requires a fixed k (i.e., one) throughout the parser. 

All states are forced to look one symbol ahead, but none may look two. 

By contrast, the LALR construction is entirely automatic. If determines 
The acceptable "subroutines" solely on the basis of the grammar, without requir- 
ing Korenjak's external specification. The user cannot suffer either by speci fy- 
ing an unacceptable partitioning or by failing to identify all acceptable 
Subgrammars. For example, Korenjak's best partitioning of the ALGOL 60 grammar 
produced a parser with 443 states, while the LALR constructor automatically 
found one with only 376 states. 

However, the principal advantage of the LALR approach is that it produces 
a local response to local ambiguities. Most parsing decisions do not require 
any lookahead at all: They can be made solely on the basis of the parse stack. 
By isolating the relatively few decisions that actually require right context 
into special "lookahead states", LALR frees all the other states from any looka- 
head considerations. A local ambiguity that requires n symbols of right con- 
text for its resolution (e.g., the : = problem of L7]) merely adds n extra states 
to an LALR parser, but explodes the LR(kK) machine exponentially. 

Finally, the advantages of segmentation, as listed in L7]J, all stem from 
the presumed exponential growth of the number of states as the number of produc- 
tions increases. Our results in this paper indicate that, for LALR applied 
to programming language grammars, the growth is linear instead. We conjecture 
that our results are in fact typical of programming: languages, and that little 


can be gained by manual segmentation of their grammars. 
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